Lab 5 - with solutions

Please note that there is a file on Canvas called Getting started with R which may be of some use. This provides details of setting up R and Rstudio on your own computer as well as providing an overview of inputting and importing various data files into R. This should mainly serve as a reminder.

Recall that we can clear the environment using rm(list=ls()) It is advisable to do this before attempting new questions if confusion may arise with variable names etc.

Example 1

In this example, we will investigate whether there is an association between personality and colour preference. This is an elaboration on Example 3.5 in the lecture notes using R. Here is the data:

Personality Red Yellow Green Blue Totals
Introvert 20 6 30 44 100
Extrovert 180 34 50 36 300
Totals 200 40 80 80 400

Follow the step-by-step guide below:

  • Firstly, we input the data and create a matrix.
R1<-c(20,6,30,44)
R2<-c(180,34,50,36)
PersonalityMatrix<-matrix(c(R1,R2),nrow=2,byrow=T,
      dimnames = list(c("Introvert", "Extrovert"),c("Red", "Yellow", "Green", "Blue")))
PersonalityMatrix
          Red Yellow Green Blue
Introvert  20      6    30   44
Extrovert 180     34    50   36
  • We next visualise the data using side-by-side bar charts.
barplot(PersonalityMatrix,
        beside=TRUE,
        ylim=c(0, 200),legend=T, col=c("yellow","blue"),
        xlab="Colour Preference",
        ylab="Frequency")

  • On the face of it, it looks like a higher proportion of Extroverts prefer the colour red, but we will have to now proceed with a statistical approach, i.e. a chi-square test of association using the following code:
result<-chisq.test(PersonalityMatrix)
result

    Pearson's Chi-squared test

data:  PersonalityMatrix
X-squared = 71.2, df = 3, p-value = 2.362e-15
  • In terms of the assumptions, we can check independence by inspection and we check the expected frequencies using the following code:
result$expected
          Red Yellow Green Blue
Introvert  50     10    20   20
Extrovert 150     30    60   60
  • The output shows no expected frequency is less than 5, satisfying the assumption.

  • Returning to the chi-square test output, we see that we have a significant result, i.e. we reject the null hypothesis that there is no association and conclude that there appears to be an association between personality and colour preference. Next we calculate the effect size. For this we need the “vcd” package in R, see below:

install.packages("vcd")
library(vcd)
assocstats(PersonalityMatrix)
                    X^2 df   P(> X^2)
Likelihood Ratio 70.066  3 4.1078e-15
Pearson          71.200  3 2.3315e-15

Phi-Coefficient   : NA 
Contingency Coeff.: 0.389 
Cramer's V        : 0.422 
  • Cramer’s V of 0.422 indicates a medium to strong effect. Finally, we perform post hoc testing using the standardized residuals approach.
install.packages("questionr")
library(questionr)
chisq.residuals(PersonalityMatrix)
            Red Yellow Green  Blue
Introvert -4.24  -1.26  2.24  5.37
Extrovert  2.45   0.73 -1.29 -3.10
  • We can think of the standardized residuals as \(z\)-scores, therefore, we look for \[ |\text{standardized residual}|>1.96 \] to be significant at the 5% level. Therefore, the values -3.1 and 2.45 are significant for Extrovert, which indicates that fewer extroverts prefer blue and more extroverts prefer red. The values 5.37 and -4.24 for Introvert are also significant, indicating that more introverts prefer blue, but fewer introverts prefer red.

Exercise 1

Investigate, checking the relevant assumptions, whether there is any relationship between Male and Female staff and Earnings based on the data provided by a local company below:

Male Female
High Earnings 37 39
Low Earnings 25 23

If there is an association investigate this.

Hint: Clearly this is a \(2\times2\) contingency table and hence we must use Yates’ continuity correction (checking assumptions). In R, this is performed using the correct=T option of the chi-square test, see below:

result2<-chisq.test(EarningsMatrix, correct=T)
result2

Solutions

  • Note that Yates’ continuity correction should be used in this example.

  • We first input the data and create a matrix.

Row3<-c(37,39)
Row4<-c(25,23)

EarningsMatrix<-matrix(c(Row3,Row4), byrow=T,nrow=2, dimnames=list(c("High","Low"),c("Male","Female")))
EarningsMatrix
     Male Female
High   37     39
Low    25     23
  • It is good practice to visualise the data.
barplot(EarningsMatrix,
        beside=TRUE,
        ylim=c(0, 50),legend=T, col=c("yellow","blue"),
        xlab="Earnings",
        ylab="Frequency")

  • Next we perform the Chi-square test of association
result2<-chisq.test(EarningsMatrix, correct=T)
result2

    Pearson's Chi-squared test with Yates' continuity correction

data:  EarningsMatrix
X-squared = 0.033991, df = 1, p-value = 0.8537
  • We should also check the expected frequencies assumption
result2$expected
     Male Female
High   38     38
Low    24     24
  • Note that no expected frequency is <5. Independence is also satisfied by visualisation.

  • The p-value is 0.8537>0.05, therefore we conclude that there does not appear to be a relationship between earnings and male/female staff.

Example 2

This example will look at using the linear-by-linear option when testing for association between ordinal variables (see Remark 3.3 in the lecture notes). Independence may be assumed, but check the expected frequencies assumption. We will investigate whether there is an association between education levels and where people live using the following fictitious dataset.

GCSE A-level Degree
Village 130 50 20
City 40 50 110

Follow the steps below in R.

  • Task: input the variables and create a matrix called EducationMatrix.

  • Check your matrix agains the output below:

        GCSE A-level Degree
Village  130      50     20
City      40      50    110
  • Next we visualise the data using bar charts:
barplot(EducationMatrix,
        beside=TRUE,
        ylim=c(0, 170),legend=T, col=c("yellow","blue"),
        xlab="Education",
        ylab="Frequency")

  • Next we perform the Chi-square test of association in order to check the expected frequencies assumption. (Independence is clear by inspection.)
result3<-chisq.test(EducationMatrix)
result3

    Pearson's Chi-squared test

data:  EducationMatrix
X-squared = 109.95, df = 2, p-value < 2.2e-16
result3$expected
        GCSE A-level Degree
Village   85      50     65
City      85      50     65
  • Note that no expected frequency is \(<5\). We now perform the linear-by-linear test (as we have an ordinal variable) using the code below. Note that we first need to covret the data into “table” form.
tab<-as.table(EducationMatrix)
tab
        GCSE A-level Degree
Village  130      50     20
City      40      50    110
library(coin)
result4<-lbl_test(tab)
result4

    Asymptotic Linear-by-Linear Association Test

data:  Var2 (ordered) by Var1 (Village, City)
Z = -10.449, p-value < 2.2e-16
alternative hypothesis: two.sided
  • This result is significant, hence there appears to be an association between location and educational levels. Next we calculate Cramer’s V:
library(vcd)
assocstats(EducationMatrix)
                    X^2 df P(> X^2)
Likelihood Ratio 118.76  2        0
Pearson          109.95  2        0

Phi-Coefficient   : NA 
Contingency Coeff.: 0.464 
Cramer's V        : 0.524 
  • As we have a significant result, we finally perform a post hoc test.
library(questionr)
chisq.residuals(EducationMatrix)
         GCSE A-level Degree
Village  4.88       0  -5.58
City    -4.88       0   5.58
  • This, together with the bar charts, indicates that more people in cities have a degree as their highest qualification, while more people in the villages have GCSEs as their highest qualification.
Back to top